Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

307 watcher incorrect handling of failed job #308

Merged
merged 15 commits into from
Dec 3, 2024

Conversation

mbthornton-lbl
Copy link
Contributor

@mbthornton-lbl mbthornton-lbl commented Nov 26, 2024

This PR provides provides updates and tests to ensure that a job from the state file with < MAX_FAILURES will be picked up and re-submitted

Changes:

  • Update get_finished_jobs to check and update job status for unfinished jobs
  • Unit test to verify correct behavior
  • Unit test to verify that 1x failed job gets re-submitted

Also update .coveragerc to exclude obsolete re_iding package, and update badges and .xml.
These are not getting automatically updated by the CI build

Log from running on Permutter in dev showing pickup and re-submit of 1x failed job:

2024-11-26 13:04:39,444 INFO: Initializing Watcher: config file: /global/homes/n/nmdcda/nmdc_automation/dev/site_configuration_nersc.toml
2024-11-26 13:04:39,445 INFO: Using state file from config: /global/cfs/cdirs/m3408/var/dev/agent.state
2024-11-26 13:04:39,445 INFO: New Job from State: nmdc:wfmag-11-g7msr323.1, nmdc:66cf64b6-7462-11ef-8b84-deaa01ab0f49
2024-11-26 13:04:39,445 INFO: Last Status: Succeeded
2024-11-26 13:04:39,445 INFO: New Job from State: nmdc:wfmag-12-h52r0792.1, nmdc:c2b7c884-ab78-11ef-8298-3e652b5abb3d
2024-11-26 13:04:39,445 INFO: Last Status: Failed
2024-11-26 13:04:39,445 INFO: Adding 2 new jobs from state file.
2024-11-26 13:04:40,575 INFO: Entering polling loop
2024-11-26 13:04:41,418 INFO: Found 0 unclaimed jobs.
2024-11-26 13:04:41,420 INFO: Checking for finished jobs.
2024-11-26 13:04:41,420 INFO: Found 1 failed jobs.
2024-11-26 13:04:41,420 INFO: Processing failed job: nmdc:sys0a5361811, nmdc:wfmag-12-h52r0792.1
2024-11-26 13:04:41,422 ERROR: Job nmdc:sys0a5361811 failed 2 times. Retrying.
2024-11-26 13:04:43,137 INFO: Submitted job 2c0212ad-83d3-4cc5-85cc-8e6434cb21f8
2024-11-26 13:04:43,137 INFO: Job 2c0212ad-83d3-4cc5-85cc-8e6434cb21f8 submitted

@mbthornton-lbl mbthornton-lbl linked an issue Nov 26, 2024 that may be closed by this pull request
@mbthornton-lbl mbthornton-lbl marked this pull request as draft November 26, 2024 22:22
@mbthornton-lbl mbthornton-lbl marked this pull request as ready for review November 27, 2024 16:53
Copy link

@shreddd shreddd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some minor comments. Probably simpler to do a walkthrough or summary of the intent of this code update to clarify what I should be looking at for review.

nmdc_automation/workflow_automation/watch_nmdc.py Outdated Show resolved Hide resolved
nmdc_automation/workflow_automation/watch_nmdc.py Outdated Show resolved Hide resolved
@mbthornton-lbl mbthornton-lbl merged commit 270afce into main Dec 3, 2024
1 check passed
@mbthornton-lbl mbthornton-lbl deleted the 307-watcher-incorrect-handling-of-failed-job branch December 3, 2024 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watcher incorrect handling of failed job
3 participants